Learning Wrappers Efficiently for Web Information Extraction Using Unlabeled Examples
ثبت نشده
چکیده
In this paper, we describe techniques for learning wrappers efficiently using very few user-supplied labels (typically, 1 or 2 labels, all within a single page). This is an improvement over previous work, which require multiple labeled examples on multiple pages. In effect, it brings the power of the wrapper down to the level of the end-user, who can teach, by only a few demonstrations, the labels that the wrapper should learn to extract. In contrast to other techniques, our approach also uses unlabeled web pages to guide the selection of appropriate features for the wrapper. We propose techniques to automatically acquire these unlabeled web pages, without the need for the user to supply them.
منابع مشابه
A Fuzzy Approach for Pertinent Information Extraction from Web Resources
Recent work in machine learning for information extraction has focused on two distinct sub-problems: the conventional problem of filling template slots from natural language text, and the problem of wrapper induction, learning simple extraction procedures (“wrappers”) for highly structured text such as Web pages. For suitable regular domains, existing wrapper induction algorithms can efficientl...
متن کاملData Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملWrapper Induction: Learning (k,l)-Contextual Tree Languages Directly as Unranked Tree Automata
A (k, l)-contextual tree language can be learned from positive examples only; such languages have been successfully used as wrappers for information extraction from web pages. This paper shows how to represent the wrapper as an unranked tree automaton and how to construct it directly from the examples instead of using the (k, l)-forks of the examples. The former speeds up the extraction, the la...
متن کاملGleaning answers from the web∗
A wide variety of valuable textual information resides on the Web, but very little is in a machineunderstandable form such as XML. Instead, the content is usually embedded in HTML markup or other encodings designed for human consumption. The information extraction task is to automatically populate a database with content gleaned from information sources such as Web pages. Wrappers are an import...
متن کاملLearning (k, l)-Contextual Tree Languages for Information Extraction
Learning regular languages from positive examples only is known to be infeasible. A common solution is to define a learnable subclass of the regular languages. In the past, this has been done for regular string languages. Using ideas from those techniques, we define a learnable subclass of regular unranked tree languages, called the (k,l)-contextual tree languages. We describe the use of this s...
متن کامل